Starting out in R, we tend to use it interactively, running one command at a time. In this section we will look at writing R code that performs repetitive task using “for-loops”, packages up commonly needed sequences of operations in “functions”, and makes decisions using “if-statements”.
Some comments on programming to aid in setting context when learning to program. What can computers do? They do the following:
Store and retrieve data and instructions - for example, variables that were discussed in the introductory course.
Perform arithemtic, including performing comparisons.
Execute stored instructions
A programming language makes these basics accessible in convenient forms. Loops, which we will meet soon, are combinations of counters (arithmetic) and comparisons to perform repeated execution of blocks of instructions.
Programming languages are collections of useful combinations of these structures that are usually presented in a formal style. Learning to use a specific programming language includes learning to read the specific formulation, or syntax (the use of brackets, keywords etc), and how to utilise the language specific structures for your problem.
As you gain experience you will build a mental library of patterns or idioms - small collections of code and concepts that you can apply, with minor modifications, to many problems. We’ll introduce a few patterns that we’ve found useful in a variety of applications.
While R is known as a statistical environment, it is built around a general purpose programming language. This is part of what has made so many extensions possible. The big gains of learning to use this language are especially evident in the early stages of a data analysis problem - importing, cleaning and visualizing the data.
We are using a couple of packages from CRAN in this tutorial, which we can install with install.packages. Don’t run this if you are using our biotraining server, the packages are already installed!
# Install the entire Tidyverse collection of packages with:
#
# install.packages("tidyverse")
#
# or just install packages needed for this section with:
#
# install.packages(c(
# "readr", # read tabular data
# "dplyr" # general data frame manipulation
# ))
One fundamental programming concept is the “if-statement”. An if-statement lets us do something only if some logical value is TRUE, or do one thing if it is TRUE and another if it is FALSE.
In the introductory R day, we used logical vectors to query a data frame. Everything we learned about making and manipulating logical vectors is applicable here, but with the restriction that we need a single logical value, i.e. a logical vector of length 1.
The general pattern of an if statement is:
if (LOGICAL_VALUE) {
THING_TO_DO_IF_TRUE
} else {
THING_TO_DO_IF_FALSE
}
Here is an example:
num <- 37 # 1
if (num > 100) { # 2
cat("greater\n") #
} else { #
cat("not greater\n") # 3
} #
cat("done\n") # 4
# --time-->
The second line of this code uses an if-statement to tell R that we want to make a choice. If the following test is TRUE, the body of the if (i.e., the lines in the curly braces underneath it) are executed. If the test is FALSE, the body of the else is executed instead. Only one or the other is ever executed.
In the example above, the test num > 100 returns the value FALSE, which is why the code inside the if block was skipped and the code inside the else block was run instead.
If-statements don’t have to include an else. If there isn’t one, R simply does nothing if the test is FALSE:
num <- 53 # 1
if (num > 100) { # 2
cat("num is greater than 100\n") #
} #
cat("done\n") # 3
# --time-->
We can also chain several tests together when there are more than two options. As an example, here is a function that returns the sign of a number:
if (num > 0) { # line 1
print(1) # line 2
} else if (num == 0) { # line 3
print(0) # line 4
} else {
print(-1) # line 5
}
num <- -3
num <- 0
num <- 2/3
Which lines above are executed, and in what order, for each value of num?
Functions are blocks of code stored in a special structure. They are modelled on the concept of a mathematical function - they receive inputs (arguments), perform operations on the inputs, and return results. We saw some examples of functions in the introduction: mean, sum, read_csv
Starting out in R mostly involves using functions that other people had written. We’re now going to see how to create our own functions. The general pattern of a function is:
FUNCTION_NAME <- function(ARGUMENT_NAME1, ARGUMENT_NAME2, ...) {
FUNCTION_BODY
}
Here is an example:
fahr_to_kelvin <- function(temp) {
(temp-32) * (5/9) + 273.15
}
We define fahr_to_kelvin by assigning it a function. The list of argument names are containted within parentheses. Next, the body of the function–the statements that are executed when it runs–is contained within curly braces ({}). The statements in the body have been indented by four spaces, which makes the code easier to read but does not affect how the code operates.
Note that there is a very important difference between defining a function and calling, or running a function.
When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. The expression inside the function is evaluated and the result is “returned” back to whoever asked for it.
Let’s try running our function. Calling our own function is no different from calling any other function:
# freezing point of water
fahr_to_kelvin(32)
# boiling point of water
fahr_to_kelvin(212)
Here are some other equivalent ways to write this function.
The braces are not actually needed if the function only contains one statement.
fahr_to_kelvin <- function(temp) (temp-32) * (5/9) + 273.15
The body of the function within the {} can contain multiple lines of code, “statements”. Only the last line is returned.
fahr_to_kelvin <- function(temp) {
kelvin <- (temp-32) * (5/9) + 273.15
kelvin
}
return can be used to return from a function immediately. Further code will not be run. I recommend always using explicit return statements.
fahr_to_kelvin <- function(temp) {
kelvin <- (temp-32) * (5/9) + 273.15
return(kelvin)
plot(1:10)
}
Statements in the function may be split over several lines, so long as it is clear that the statement is incomplete due to an unclosed bracket or an operator in need of a right hand argument.
fahr_to_kelvin <- function(temp) {
kelvin <-
(temp-32) * (5/9) +
273.15
return(kelvin)
}
Spacing and layout is largely for our own sanity.
fahr_to_kelvin<-
function(
temp){(temp
-32 )*( 5
/9 )+
273.15}
Beware! however of accidentally finishing a statement early.
fahr_to_kelvin_broken <- function(temp) {
(temp-32) * (5/9)
+ 273.15
}
fahr_to_kelvin_broken(212)
The first line in the body of the function is computed then discarded. The second line, + 273.15, is valid R so we don’t even get an error.
Finally, the computer doesn’t understand or care what names we give to functions, arguments, or variables.
charmander <- function(bulbasaur) {
mew <- (bulbasaur-32) * (5/9) + 273.15
mew
}
Now that we’ve seen how to turn Fahrenheit into Kelvin, it’s easy to turn Kelvin into Celsius:
kelvin_to_celsius <- function(temp) {
temp - 273.15
}
#absolute zero in Celsius
kelvin_to_celsius(0)
What about converting Fahrenheit to Celsius? We could write out the formula, but we don’t need to. Instead, we can compose the two functions we have already created:
fahr_to_celsius <- function(temp) {
temp_k <- fahr_to_kelvin(temp)
temp_c <- kelvin_to_celsius(temp_k)
temp_c
}
# freezing point of water in Celsius
fahr_to_celsius(32.0)
This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here–typically half a dozen to a few dozen lines–but they shouldn’t ever be much longer than that, or the next person who reads it won’t be able to understand what’s going on.
Function names and argument names are important cues to what code is doing.
Functions also help manage code complexity by cleaning up their contents after completion.
You might have noticed that all of these functions have an argument called temp. Why hasn’t this caused confusion and chaos? The answer is that arguments to a function and variables defined in the body of a function are local to a call to that function. Within the body of a function, R code can see variables from its own local environment, and variables from the global environment, but not variables local to other function calls. While running code R maintains a stack of local environments. When a call to a function returns, all of its local arguments and variables disappear.
Tip: There are several ways to see this in action.
Scatter cats through your functions showing the values of arguments and variables at various points.
Insert a call to browser() in one of your functions. This will pause execution and let you interact with the local environment in a function. Once in the browser, type “help” for a list of commands. Besides examining variables, you can step through the function from the point it paused with “n” or, at a finer grain, step into functions it calls with “s”. When you are done type “f”.
You can debug an existing function using debugonce. This is like temporarily adding a call to browser() at the top of the function.
# debugging a function
debugonce(fahr_to_celsius)
fahr_to_celsius(212)
Write a function to calculate the length of the hypotenuse of a right angled triangle using Pythagorus’s rule, given the lengths of the other sides.
Hint: sqrt calculates the square root of a number.
Testing your code is important. Invent a test case for your code consisting of:
Confirm that your function works as expected.
There are three common approaches to repetitive tasks in R:
Method 1 is error prone, and also hard to fix if you find a mistake in your code.
We will now look at method 2, using for-loops. The general pattern of a loop is:
for(VARIABLE_NAME in VECTOR) {
FOR_LOOP_BODY
}
Let’s look at an example. Suppose we have a calculation we need to perform on a series of numbers.
i <- 10
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 20
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 30
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 40
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 50
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
Apart from the i <- ... lines, the code is the same each time. This can be re-written as a for-loop:
for(i in c(10,20,30,40,50)) { # 1
cat("i is",i,"\n") # 2 4 6 8 10
cat("i squared is",i*i,"\n") # 3 5 7 9 11
} #
#
cat("done\n") # 12
# --order-of-execution-->
We can name the loop variable anything we like. in is part of the for syntax. The body of the loop is enclosed in curly braces { }. For a single-line loop body the braces aren’t needed, but it is good practice to always include them.
The loop variable i is assigned each element in the vector in turn, and then the loop body runs. There’s no possibility here or making a mistake when copying and pasting thhe code, and if we later discover we need to change this code, the for-loop version will be much easier to change. It will also be possible to run this code a different number of times by changing the vector used.
cats or prints to the loop body to check what is going on.myvec <- c(10,20,30,40)
total <- 0
for(item in myvec) {
total <- total + item
}
total
myvec?myvec <- c(10,20,30,40)
for(index in 1:4) {
myvec[index] <- myvec[index] * 2
}
1*2*3*4* ... *10.numbers <- 1:10
... your code here ...
Let’s say we want to examine the quality of some FASTQ files, which contain reads from a DNA sequencing machine. An experiment has been performed over several days, and we want to run a program called “fastqc” on each of the FASTQ files.
FASTQ is a text format containg a series of DNA sequences and associated quality information. Examine “Day0.fastq” in the “r-progtidy-files” folder.
(This data is a very small portion of the reads from an experiment inducing pluripotency in a mouse cell line, GSE70022.)
Software outside of R is normally run from a “terminal” window, which is rather like the console in RStudio. You type in a command and the operating system runs it for you. The commands you type in are interpreted by something called a “shell”, because it acts like a shell around the operating system. For more information, see the Software Carpentry course on the Unix shell.
In RStudio, you can open a terminal using “Tools/Terminal/New Terminal”. Try this and type in:
uptime
The “uptime” command tells you how long the computer has been running for. R can give a command to the shell using the system function.
system("uptime")
Here is how we want to run FastQC for day 0. To R the command has no meaning beyond being a character string. It will be interpreted by the shell, which is external to R. The shell will then run FastQC for us.
system("fastqc --extract --outdir . r-progtidy-files/Day0.fastq")
In the “Files” tab of the bottom right pane in RStudio, you should see that some new files and a directory have been created. Click on “Day0_fastqc.html” to examine it. Also examine the file in the “Day0_fastqc” folder called “summary.txt”.
If you don’t have FastQC, or it failed to run for some reason: The expected output files can be found in the folder “r-progtidy-files/fastqc-output”.
Typing out the fastqc command for each day will get repetitive. Let’s use a for-loop to automate this task. The paste0 function lets us “paste” together character strings, so we can use that to construct the commands to run, like this:
# construct a command to run
day <- 0
command <- paste0("fastqc --extract --outdir . r-progtidy-files/Day", day, ".fastq")
command
We want to do this for each day, and then use system to run the resulting command:
# run fastqc on each of the files
days <- c(0,4,7,10,15,20)
for(day in days) {
command <- paste0("fastqc --extract --outdir . r-progtidy-files/Day", day, ".fastq")
cat("Running the command:", command, "\n")
system(command)
}
Note: If you weren’t able to run FastQC earlier, the expected output files can be found in the folder “r-progtidy-files/fastqc-output”. You will need to adjust the filenames appropriately in the R code below.
The summary.txt files are in “tab separated value” format. This is similar to comma separated value format we’ve seen in the introductory R day, but instead of using a comma it uses a special character called a tab to delimit columns. Tabs show up as variable amounts of space in a text editor. In R, they can be written "\t".
In base R we could use the function read.delim to read this file. It’s quite similar to read.csv. However, let’s use the Tidyverse package readr to load the file. To use readr we first need to load it with library.
library(readr)
We can now read the file like this:
read_tsv("r-progtidy-files/fastqc-output/Day0_fastqc/summary.txt")
Oh! There aren’t column headings in this file. We need to tell read_tsv this.
read_tsv("r-progtidy-files/fastqc-output/Day0_fastqc/summary.txt", col_names=FALSE)
We should tidy this up a bit before using it. Columns should have meaningful names, and the PASS/WARN/FAIL grading looks like it should be a factor:
filename <- "r-progtidy-files/fastqc-output/Day0_fastqc/summary.txt"
sumtab <- read_tsv(filename, col_names=FALSE)
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
We expect to have to examine many FastQC reports in future. It will be convenient to have this as a function!
filename is going to be different each time we use this code, so it wants to be an argument to the function. The rest of the code goes in the body of the function. The returned value will be sumtab.
load_fastqc <- function(filename) {
sumtab <- read_tsv(filename, col_names=FALSE)
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
}
load_fastqc("r-progtidy-files/fastqc-output/Day0_fastqc/summary.txt")
We now want to load each of the summary.txt files we created earlier. One perfectly valid way would be to use a for-loop. However here we’re going to use an “apply”-style function. “apply”-style functions are functions that call some other function repeatedly on subsets of a data set. R has a variety of functions with names such as apply and tapply, and several other functions where the idea is similar such as aggregate. apply itself is for use with matrices and produces a vector as output. Other functions in this family have different input and output types.
The flavour of apply we need here is lapply, which takes a vector and calls a function on each element in turn of that vector (i.e. the “subsets” here are simply individual elements). The result is a list containing the returned values, hence the name lapply.
First let’s work out the names of the files we want to load.
days <- c(0,4,7,10,15,20)
filenames <- file.path("r-progtidy-files/fastqc-output", paste0("Day", days, "_fastqc/summary.txt"))
filenames
Now we can use lapply to apply our function to each of the filenames.
sumtabs <- lapply(filenames, load_fastqc)
lapply is one of the base R functional programming tools. It stands for loop-apply and performs a specific class of loops without the user needing to write the loop structure. It can lead to clear code, when you are used to seeing it.
This example is equivalent to:
sumtabs <- list()
for (idx in 1:length(filenames)) {
sumtabs[[idx]] <- load_fastqc(filenames[idx])
}
We now have a list of data frames. Lists are a little different from the vectors we usually work with. See section 20.5 in the “R for data science” book for a comprehensive explanation. Here are some ways to examine this object:
Basic differences are that lists can contain things of different types and different sizes. In this case, we have a list of data.frames, which are not a basic R type (a vector of something)
sumtabs
class(sumtabs)
length(sumtabs)
str(sumtabs)
When we print sumtabs on the console, it gives us a hint how to access individual elements: using the [[ ]] syntax:
sumtabs[[1]]
sumtabs[[2]]
This is an ungainly data structure, we would like to turn it into a single a single big data frame. The dplyr package provides a function to do this:
library(dplyr)
bigtab <- bind_rows(sumtabs)
bigtab
table(bigtab$test, bigtab$grade)
Base R has a similar idiom to do the same task. It is more flexible, but does less checking. I use it to bind vectors into a data.frame, which bind_rows refuses to do.
bigtab <- do.call(rbind, sumtabs)
What we’ve just done is:
Create a custom reader tool that sets up each file inside R
Use a looping tool to apply that function to a collection of file names
Transform the results produced by the looping tool into a data.frame that we can use for other purposes.
Our load_fastqc function will currently fail with an error if the file it is passed does not exist.
load_fastqc("nosuchfile.txt")
Depending on how we intend to use it, we might instead want the function to issue a warning and return NULL.
load_fastqc <- function(filename) {
# Check arguments are sane
if (!file.exists(filename)) {
warning("No such file: ", filename)
return(NULL)
}
# Load and tidy data
sumtab <- read_tsv(filename, col_names=FALSE)
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
}
load_fastqc("nosuchfile.txt")
Checking input arguments before proceeding is a common pattern when writing R functions.
Having developed some useful functions, we might want to re-use them in a future project or share them with a friend. We can store R code in a file, usually with extension “.R”, and load it with the source function.
Put your load_fastqc function in a file called “fastqc.R”. It uses the readr library, so be sure to load the library in the file as well.
# fastqc.R file should contain:
library(readr)
load_fastqc <- function(filename) {
sumtab <- read_tsv(filename, col_names=FALSE)
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
}
# From the console:
source("fastqc.R")
Everything in the file runs as though it were typed on the console.
Similarly, if you are going to be doing several different things with a data set, you could write a .R file to load the data into a tidy form, and several other .R scripts to do various different things with it.
What other R code from this lesson could we put in a .R file? Or a .Rmd file?
How should we break up a large project (paper/thesis/software package) into files?
What about managing data files?
How should we share a project with others?
Importing microscopy data and generating a report.
Packages are the next step up from sourcing .R files. They let you write code that other people can install and then load with library.
Hadley Wickham has a great package called devtools that takes a lot of the pain out of package writing. Also useful is the package usethis automates creation of a package directory and basic files, and can set up various other parts of a package.
Packages generally contain documentation and example code for all functions, and one or more vignettes describing how to use the package. Function documentation is usually written as specially formatted comments before each function using roxygen2.
library(devtools)
library(usethis)
# Create an empty package template
create_package("mypack", open=FALSE)
# ... Edit mypack/DESCRIPTION file
# ... Write .R files in mypack/R folder
# For example create a file mypack/R/hello.R containing:
#----
#' @export
hello <- function() {
cat("Hello, world.\n")
}
#----
# Load package. Use this during development.
load_all("mypack")
hello()
# Build package documentation, converting inline documentation to .Rd files using roxygen2.
# Update NAMESPACE file (lists functions the package exports for public consumption).
document("mypack")
# Check for common problems and missing documentation.
# This also automatically runs document( ).
# A CRAN package must pass all checks.
check("mypack")
create_package creates a .Rproj project file for the package. If you open this project, RStudio changes to the package directory, and there is a new “Build” tab. This provides tools that parallel what Hadley’s devtools package provides with load_all and check, you can decide which you prefer! Old-school R developers may also use R CMD commands from the terminal (try typing R -h in the “Terminal” tab).
Packages are most easily distributed by placing them on GitHub or GitLab. They can then be installed by others using the devtools functions install_github or install_gitlab. Once a package is mature and well documented, it can be submitted to CRAN or Bioconductor.
# To install from GitHub:
devtools::install_github("myusername/mypack")
sessionInfo()